On Feature Selection: A New Filter Model

نویسنده

  • Marc Sebbna
چکیده

We focus on the filter approach offeature selection. We exploit geometrical information of the learning set to build an estimation criterion based on a quadratic entropy. The distribution of this criterion is approximately normal, that allows the construction of a non pararnctric~l statistical test o assess the relevance of feature subsets. Weuse the critical threshold ofthis test, callcci the test o[" Relative Certainty Gain, in a forward selection algorithm. We present some experimental results both on synthetic a~ld natural domains of the UCT databa~ rep~sitory, which show significa~ltly improvements on ttne accuracy estinmtes. Introduction While the problem of feature selection has always been at the center of statistic researches, it is only recently that this problem received attention in computer scicnce. Beyond the intcntion of improving the performance of their algorithn~s, machine learning researchers studied featurc selection methods to face the explosion of data (not always relevant) provided by recent data collecting technologies (the Web for instance). Prom a theoretical standpoint, the selection of a good feature subset is of little interest. Actually, a Bayesian classifier is monotonic, i.e., adding features can not decrc~.sc the model’s performance. This is generally true only for infinite learning sets for which the estimate errors can be ignored. In fact, practical algorithms not always being perfect, the monotonicity assumption rarely holds (Kohavi 1994). Thus, irrelevant or weakly relevant features may" reduce the accuracy of the model. A study in (Thrun et al. 1991) shows that with the C4.5 algorithm (Quinlan 1993) non deletion of weakly relevant featur~ generates deeper decision trees with lower performanc~ than those obtained without these features. In (Aha 1992), the author shows that the storage of the IB3 algorithm increases exponentially with the number of irrelevant features. Same sort of conclusions arc presented in (Langley and Iba 1993). These results have encouraged scienti~s to elaborate sophisticated feature sclection methods aUowing to: °Col,.right ~ 1999, American Association for Artificial Intelligence (~vw.aaai.org). All rights reserved. ̄ Reduce classifier’s cost and complexity. ̄ Improve model accuracy. ̄ Improve the visualization and comprehensibility of induced concepts. According to the terminology proposed in (John, Kohavi and Pfleger 1994), two approaches are available: the wrapper and filter models. In filter models, the accuracy of the future induced classifier is assessed using statistical techniques. The method "filter out" irrelevant features before the induction process. In wrapper methods, we search for a good subset of features using the induction algorithm. The principle is generally based on the optimization of the accuracy rate, estimated by one of the following methods: holdout, cross-validation (Kohavi 1995), or bootstrap (Efron and Tibshirani 1993). Whatever the method of feature selection we use, the goal is always to assess the relevance of alternative subsets. A survey of relevance definitions is proposed in (Blum and Langley 1997). In this article, we consider the filter approach to find relevant features. We will explain in detail arguments about this choice. We exploit characteristics of a neighborhood graph built on the learning set, to compute a new estimation criterion based on a quadratic entropy. We show that the distribution of this criterion is approximately normal, that allows the construction of a non parametrical test to assess the quality of feature subsets. We use this statistical test (more precisely the critical threshold) ina forward selection. Finally, we present some experimental results on benchmarks of the UCI database repository, comparing performances of sdected feature subsets with rcsults obtained in thc original spaces. Feature Selection and Filter Model

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Hybrid Framework for Filter based Feature Selection using Information Gain and Symmetric Uncertainty (TECHNICAL NOTE)

Feature selection is a pre-processing technique used for eliminating the irrelevant and redundant features which results in enhancing the performance of the classifiers. When a dataset contains more irrelevant and redundant features, it fails to increase the accuracy and also reduces the performance of the classifiers. To avoid them, this paper presents a new hybrid feature selection method usi...

متن کامل

Developing a Filter-Wrapper Feature Selection Method and its Application in Dimension Reduction of Gen Expression

Nowadays, increasing the volume of data and the number of attributes in the dataset has reduced the accuracy of the learning algorithm and the computational complexity. A dimensionality reduction method is a feature selection method, which is done through filtering and wrapping. The wrapper methods are more accurate than filter ones but perform faster and have a less computational burden. With ...

متن کامل

Fuzzy-rough Information Gain Ratio Approach to Filter-wrapper Feature Selection

Feature selection for various applications has been carried out for many years in many different research areas. However, there is a trade-off between finding feature subsets with minimum length and increasing the classification accuracy. In this paper, a filter-wrapper feature selection approach based on fuzzy-rough gain ratio is proposed to tackle this problem. As a search strategy, a modifie...

متن کامل

Online Streaming Feature Selection Using Geometric Series of the Adjacency Matrix of Features

Feature Selection (FS) is an important pre-processing step in machine learning and data mining. All the traditional feature selection methods assume that the entire feature space is available from the beginning. However, online streaming features (OSF) are an integral part of many real-world applications. In OSF, the number of training examples is fixed while the number of features grows with t...

متن کامل

Feature Selection Using Multi Objective Genetic Algorithm with Support Vector Machine

Different approaches have been proposed for feature selection to obtain suitable features subset among all features. These methods search feature space for feature subsets which satisfies some criteria or optimizes several objective functions. The objective functions are divided into two main groups: filter and wrapper methods.  In filter methods, features subsets are selected due to some measu...

متن کامل

Bridging the semantic gap for software effort estimation by hierarchical feature selection techniques

Software project management is one of the significant activates in the software development process. Software Development Effort Estimation (SDEE) is a challenging task in the software project management. SDEE is an old activity in computer industry from 1940s and has been reviewed several times. A SDEE model is appropriate if it provides the accuracy and confidence simultaneously before softwa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999